-
Notifications
You must be signed in to change notification settings - Fork 203
Report crashing OTEL process cleanly with proper status reporting #11448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
💚 Build Succeeded
History
cc @blakerouse |
swiatekm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic looks good to me, but I have some concerns about deadlocks in the otel manager caused by trying to report status from the main loop.
| exitCode := 0 | ||
| err = cmd.RunCollector(ctx, nil, true, "debug", monitoringURL) | ||
| if err != nil && !errors.Is(err, context.Canceled) { | ||
| fmt.Fprintln(os.Stderr, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, if you use otelcol builder, its generated main.go uses log.Fatalf here. We should do the same if we can.
| // pipelines | ||
| for pipelineID, pipelineCfg := range c.Service.Pipelines { | ||
| for _, recvID := range pipelineCfg.Receivers { | ||
| instanceID := componentstatus.NewInstanceID(recvID, component.KindReceiver, pipelineID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth commenting that the upstream graph building code creates a single instance id for each receiver, containing information about all the pipelines it appears in. The status reporting then makes a copy of the status for each pipeline, so it's fine for us to do so here as well, but it's worth calling out imo.
| forIDStr := fmt.Sprintf("for id: %q", instanceID.ComponentID().String()) | ||
| failedMatchStr := fmt.Sprintf("failed to start %q %s:", instanceID.ComponentID().String(), strings.ToLower(instanceID.Kind().String())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be good to specify what conditions these lines refer to. The one about starting the component is relatively obvious, but forIDStr, not so much. Is this an error about the otel collector not knowing about the component type?
|
|
||
| // reportStartupErr maps this error to the *status.AggregateStatus. | ||
| // this is done by parsing the `m.mergedCollectorCfg` and converting it into the best effort *status.AggregateStatus. | ||
| func (m *OTelManager) reportStartupErr(ctx context.Context, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I don't like about this function, is that it pretends to report an error, but actually reports a status. We have different delivery requirements for these. Reporting errors is non-blocking, as a newer error can always overwrite an older error. The same is not true about statuses - we need to make sure all the component statuses are delivered, or else we get bugs like the one in #10675.
As a result, reportStartupErr needs to be called with care, as it can potentially deadlock the manager. I wonder if we should bite the bullet and just make the update channel buffered with some reasonable size, and emit a fatal error if it can't be written to. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran the collector with the following configuration which should not run:
outputs:
default:
type: elasticsearch
hosts: [127.0.0.1:9200]
api_key: "example-key"
#username: "elastic"
#password: "changeme"
preset: balanced
otel:
exporter:
not_a_setting: trueIt does not run as expected and the collector exits with the following which looks a lot like a multi-line error but we manage to grab the last line which at least contains the configuration key name:
{"log.level":"info","@timestamp":"2025-11-27T19:32:59.575Z","message":"failed to get config: cannot unmarshal the configuration: decoding failed due to the following error(s):","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-11-27T19:32:59.575Z","message":"'exporters' error reading configuration for \"elasticsearch/_agent-component/monitoring\": decoding failed due to the following error(s):","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-11-27T19:32:59.575Z","message":"'' decoding failed due to the following error(s):","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2025-11-27T19:32:59.575Z","message":"'' has invalid keys: not_a_setting","ecs.version":"1.6.0"}❯ sudo elastic-development-agent status
┌─ fleet
│ └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
├─ status: (DEGRADED) 1 or more components/units in a failed state
├─ beat/metrics-monitoring
│ ├─ status: (FAILED) FAILED
│ ├─ beat/metrics-monitoring
│ │ └─ status: (FAILED) '' has invalid keys: not_a_setting
│ └─ beat/metrics-monitoring-metrics-monitoring-beats
│ └─ status: (FAILED) '' has invalid keys: not_a_setting
├─ filestream-monitoring
│ ├─ status: (FAILED) FAILED
│ ├─ filestream-monitoring
│ │ └─ status: (FAILED) '' has invalid keys: not_a_setting
│ └─ filestream-monitoring-filestream-monitoring-agent
│ └─ status: (FAILED) '' has invalid keys: not_a_setting
├─ http/metrics-monitoring
│ ├─ status: (FAILED) FAILED
│ ├─ http/metrics-monitoring
│ │ └─ status: (FAILED) '' has invalid keys: not_a_setting
│ └─ http/metrics-monitoring-metrics-monitoring-agent
│ └─ status: (FAILED) '' has invalid keys: not_a_setting
├─ prometheus/metrics-monitoring
│ ├─ status: (FAILED) FAILED
│ ├─ prometheus/metrics-monitoring
│ │ └─ status: (FAILED) '' has invalid keys: not_a_setting
│ └─ prometheus/metrics-monitoring-metrics-monitoring-collector
│ └─ status: (FAILED) '' has invalid keys: not_a_setting
└─ extensions
├─ status: StatusFatalError ['' has invalid keys: not_a_setting]
└─ extension:healthcheckv2/bcc9882f-6e23-4ec6-b6a7-0abecb4c2ded
└─ status: StatusFatalError ['' has invalid keys: not_a_setting]
That looks correct enough except for us leaking out the healthcheck extension status verbatim (but not beatsauth or the diagnostics extension) which we probably shouldn't do (or at least it is inconsistent).
| // The flow of this function comes from https://github.com/open-telemetry/opentelemetry-collector/blob/main/service/internal/graph/graph.go | ||
| // It's a much simpler version, but follows the same for loop ordering and building of connectors of the internal | ||
| // graph system that OTEL uses to build its component graph. | ||
| func otelConfigToStatus(cfg *confmap.Conf, err error) (*status.AggregateStatus, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is complex enough to have its own tests. The configurations in manager_test.go right now don't cover it all e.g. there are no configurations with connectors.
To me it seems like tests for this should focus on mapping status for complex configurations and the tests in manager_test.go can focus purely on ensuring various failure scenarios result in capturing a status regardless of what the configuration is.
What does this PR do?
At the moment when a spawned OTEL subprocess fails it is just reported as exit code 1. It provides no information of what has failed and marks the entire Elastic Agent as failed.
This changes that behavior by looking at the actual output to determine the issue and report a proper component status for the entire configuration. It does this by parsing the configuration and building its own aggregated status for the component graph that would be returned by the healthcheckv2 extension if it could successfully run. It inspects the error message to determine if it can correlate the error to a specific component in the graph. If it cannot it falls back to reporting error on all components.
Why is it important?
The Elastic Agent needs to provide clean status reporting even when the subprocess fails to run. It needs to not mark the entire Elastic Agent in error when that happens as well.
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration files./changelog/fragmentsusing the changelog tool[ ] I have added an integration test or an E2E test(covered by unit tests)Disruptive User Impact
None
How to test this PR locally
Use either an invalid OTEL configuration or one that will fail to start. Observer that when running
elastic-agent runwith that OTEL configuration in theelastic-agent.ymlfiled (aka. Hybrid Mode) thatelastic-agent status --output=fullprovides correct state information.Related issues